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Abstract 

Traditional statistical methods for confidentiality protection of statistical databases 
do not scale well to deal with GWAS (genome- wide association studies) databases espe- 
cially in terms of guarantees regarding protection from linkage to external information. 
The more recent concept of differential privacy, introduced by the cryptographic com- 
munity, is an approach which provides a rigorous definition of privacy with meaningful 
privacy guarantees in the presence of arbitrary external information, although the guar- 
antees come at a serious price in terms of data utility. Building on such notions, we 
propose new methods to release aggregate GWAS data without compromising an in- 
dividual's privacy. We present methods for releasing differentially private minor allele 
frequencies, chi-square statistics and p-values. We compare these approaches on sim- 
ulated data and on a GWAS study of canine hair length involving 685 dogs. We also 
propose a privacy-preserving method for finding genome-wide associations based on a 
differentially-private approach to penalized logistic regression. 

Key Words: chi-squared statistics; contingency tables; differential privacy; 
genome- wide association studies (GWAS); logistic regression; p- values; single nucleotide 
polymorphism (SNP). 

1 Introduction 

In an article that shocked the genetics community, Homer et al. [32] claimed that, under 
certain conditions, they could use statistical methods to "accurately and robustly [resolve]" 
the presence of an individual with known genotype in a mix of DNA samples from which only 
the minor allele frequencies (MAFs) are known. Their approach compared the MAFs of a 
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specific individual to the distribution of MAFs in a reference population and the distribution 
of MAFs in a test population and then used a t-test to assess if the individual was part of 
the test population. 

Although proposed specifically for use in a forensic context and only secondarily for 
breaking privacy, the Homer et al. p2] "attack" appeared to be generally applicable. As 
a reference population one might use the publicly available single nucleotide polymorphism 
(SNP) data from the HapMap project^] which consists of SNP data from 4 populations 
varying in size from 45 to 90 individuals. Note that the HapMap data set does not contain 
any information regarding the health status of the individuals. For the test population 
one might use the cases in genome- wide association studies (GWAS), which contain both 
genotype data and disease status. Before the appearance of the article [12], the averaged 
MAFs of the cases and the averaged MAFs of the controls in a GWAS were typically publicly 
available. 

In response to Homer et al. [12] , Braun et al. [3] showed that their proposed test depends 
heavily on the assumption that the genotypes of the test population, the reference population 
and the specific person under consideration are samples from the same underlying popula- 
tion and that the SNPs used in the study are independent (i.e., that there is no linkage 
disequilibrium present). These assumptions are usually not met in practice, and as a conse- 
quence, the Homer et al. attack lead to a high false-positive rate, see e.g. Braun et al. [3]. 
Others have criticized Homer et al., suggested alternative formulations of the identification 
problem, claimed to strengthen the attack or suggested different ways to protect the data, 
e.g., see 0II31IHE3II3IIE1I1SI221I2I)]. Despite the apparent limitations of the Homer 
et al. attack on the privacy of GWAS participants and the controversial and, we believe, 
exaggerated nature of their statistical claims, NIH immediately removed from open-access 
databases all aggregate results such as values of averaged MAFs over cases and controls, 
chi-square (x 2 )-statistics and p- values (see Couzin [7] and Zerhouni and Nabel [25]). The 
NIH policy remains in effect today. Every researcher, who wants to gain access to any of 
these data sets, needs to go through an elaborate approval process. This is a particularly 
difficult obstacle for computer scientists, mathematicians or statisticians who do not have a 
credible record in GWAS research. 

Here we propose methods which allow for the release of aggregate GWAS data without 
compromising an individual's privacy, and in many ways totally bystep the debate on the 
validity of the claims by Homer et al. [12] and others on the vulnerability of GWAS databases. 
Our GWAS privacy guarantees utilize the concept of differential privacy, recently introduced 
by the cryptographic community (e.g., Dwork et al. [9]). Differential privacy provides a 
rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary 
external information. Our contributions follows: 

• We propose a method for the release of the averaged MAFs for the cases and for the 
controls in GWAS without compromising an individual's privacy. 

1 http://hapmap. ncbi.nlm.nih.gov/ 
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• We compute e-differentially private x 2 -statistics and p-values and provide a differen- 
tially private algorithm for releasing these statistics for the most relevant SNPs. 

• Conditions such as cancer, heart disease, and diabetes are caused by the interaction of 
various genes and possibly the environment. Detecting such interaction among SNPs 
related to a specific phenotype (i.e., epistasis) is a main goal of GWAS. Most methods 
for finding epistasis are based on a two-stage approach: (1) Filtering all SNPs, e.g., 
using x 2 -statistics or a simple logistic regression, to reduce the potentially interacting 
SNPs to a small number; (2) Further examining the loci achieving some threshold 
for interactions. For example, Park and Hastie [19] use a form of penalized logistic 
regression to test for detecting gene-gene interactions on a small number of SNPs. By 
adapting the work of [Tj and [5] to this methodology, we derive a privacy-preserving 
method for GWAS, where both stages in the two-stage approach satisfy e-differential 
privacy. 

Section 2 describes the basic problem and relevant definitions. In Section 3, we present 
methods for releasing e-differentially private MAFs, x 2 -statistics and p- values, and in Section 
4 we evaluate their statistical utility on data based on a simulation study and on a GWAS 
study of canine hair length involving 685 dogs. In Section 5, we propose a differentially- 
private method for finding genome- wide associations based on a penalized approach to logistic 
regression. 

2 Main Definitions and Notation 

In a typical GWAS setting, we study the interaction between various SNPs and a binary 
phenotype, as for example the disease status of an individual. The binary phenotype takes 
values (e.g., non-diseased) and 1 (e.g., diseased). We denote the total number of individuals 
in a GWAS by iV and assume throughout the paper that the number of cases and controls 
is equal, i.e., there are N/2 cases and N/2 controls. This corresponds to the usual setting 
in GWAS and is necessary in order to achieve sufficient power to detect SNPs which are 
associated with a disease. We denote the total number of SNPs in a GWAS by M' and 
the number of SNPs for which we would like to release aggregate data by M. We assume 
that the SNPs are polymorphic with only two possible nucleotides. The SNPs therefore take 
values 0, 1, and 2 representing the number of minor alleles. We summarize the data for each 
SNP in a 3 x 2 contingency table, where the count in cell consists of the number of 
individuals with genotype i and disease status j. We also assume throughout the paper that 
all margins of such a 3 x 2 contingency table are positive. This is motivated by the fact that 
in GWAS usually all SNPs with a MAF smaller than 0.05 are removed from the study. 

Definition 2.1. A randomized mechanism K, is e-differentially private if, for all data sets 
D and D' which differ in at most one individual and for any t el, 

Pr(/C(D) =t) e 
Pr(/C(D') = t) ~ 



3 



Definition 2.2. The sensitivity of a function / : T> N — > IR d , where T> N denotes the set of all 
databases with N individuals, is the smallest number S(f) such that 

\\f(D)-f(D')\\ 1 <S(f), 

for all data sets D,D' £ T> N differing in a single individual. 

Releasing f(D) + b, where b is random noise drawn from a Laplace distribution with 
mean and scale ^p- satisfies the definition of e-differential privacy (e.g., see [9]). This type 
of release mechanism is often referred to as the Laplace mechanism. 

Definition 2.3. The KL divergence between two probability distributions / and g is defined 
by 

r°° f(x) 

D KL (f\\g)= / f( x )log J -^dx. (1) 

For the analysis of the simulation results in Section 3 we use the KL divergence to 
measure the difference between two distributions such as the original x 2- statistic and its 
corresponding e- differentially private version. 

3 Privacy-Preserving Methodology 

In this section we compute the sensitivity of MAFs, x 2_s t a tistics and p-values needed to 
release the private versions of these statistics for each SNP via the Laplace mechanism. We 
also describe an e-differentially private algorithm for the release of the latter two quantities 
for the M most relevant SNPs. 

3.1 Privacy-Preserving Release of Aggregate MAFs 

We now describe a method for releasing the averaged MAFs for the cases and for the controls 
in GWAS which satisfies differential privacy. The true data form a table consisting of the 
MAFs of the cases and the controls for M SNPs; e.g., see Table [T] In the following, we 
compute the amount of Laplace noise we need to add to such a table in order to satisfy 
e-differential privacy. 

Lemma 3.1. The sensitivity of the averaged MAFs of the cases and the controls based on 
N individuals, with N/2 cases and N/2 controls, for M SNPs is ^f. 

Table 1: Table showing the averaged MAFs of the cases and the controls for M SNPs. 



MAF 


SNP 1 


SNP 2 




SNP M 


Cases 


0.29 


0.20 




0.11 


Controls 


0.27 


0.31 




0.10 
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Proof. Without loss of generality, we can assume that the individual, whose genotype we can 
change, belongs to the cases. Denote this individual by j. For a given SNP we denote the 
number of minor alleles of individual i before adding noise by a, and the perturbed counts 
by a\. Note that aj = a\ for all i ^ j. In addition, \aj — a'A < 2. Therefore, for a given SNP 
we can compute the sensitivity of the averaged MAF as follows: 

N/2 

E 



1 

iV/2 



~2 



N/2 



N/2 , 



1 


a j a 'j 


~~ N/2 


2 2 



< 



2 

N' 



This holds for every SNP. As a consequence, for M SNPs the sensitivity is ^M-, namely the 



1-norm of the M-dimensional vector where all entries are 



□ 



Lemma 3A shows that a data release mechanism that adds Laplace noise with mean 
and scale ^ to each cell entry in Table [l] yields e-differential privacy. This result can be 
seen as a special case of Example 3 in [0] where every cell entry is a histogram by itself. 

Similarly, if instead of releasing the averaged MAFs, we want to release M 3 x 2 tables 
containing the counts for each genotype and disease status, the sensitivity would be 2M. 
Therefore, we have to add Laplace noise with mean and scale — to ensure e-differential 
privacy. 



3.2 Privacy-Preserving Release of x 2 -Statistics and p- Values 

In many GWAS settings, researchers report the x 2_s ^atistics and the p-values of the most 
relevant SNPs. We propose a method for releasing these quantities in a differential privacy- 
preserving way, by first computing the sensitivity and then modifying a method proposed in 
PP , for release of frequent itemsets, to release the noisy statistics corresponding to the most 
relevant SNPs. 

Theorem 3.2. The sensitivity of the ^-statistic based on a 3 x 2 contingency table with 
positive margins and N/2 cases and N/2 controls is j^t^- 

Proof. Consider the following 3x2 contingency table with positive margins and N/2 cases 
and controls each: 

Disease Status 





1 


No. Individuals 


a m-a 


With Genotype 1 


b n-b 


2 


N/2-a-b N/2-m-n+a+b 


Total 


N/2 N/2 


with a, b > 0, m, n > 0, a < m, b < n, a + b < N/2, and m + n < N . 


V = {(a,b,m } n)eN 


m > 0, n > 0, a < m, b < 


a + b< N/2, m + n< N}. 



Then we can view the x 2 -statistic as a function 



x 2 --v 



where (a, 6, m, n) gets mapped to the x 2-s tatistic of the corresponding contingency table. 
The sensitivity corresponds to the values of (a,b,m,n) G V fl {a > 1}, which maximize 

\x 2 (a,b,m,n) — x 2 ( a ~ 1,6 + — l,n + 1)|. 

Our approach is to compute the sensitivity by maximizing the directional derivative of 
X 2 {a,b,m,n) in direction (—1/2,1/2,-1/2,1/2). First note that 



X 2 (a,b,m,n) 



(2a -mf (2b -nf 



m 



n 



(2a-m + 2b- nf 
N — m — n 



We then compute the directional derivative of x 2 (a, 6, rn, n) in direction (—1/2, 1/2, —1/2, 1/2). 
It is given by 

2a 2 4a 26 2 46 
m 2 m n 2 n 

Over V H {a > 1} this is maximized by the smallest possible value of a, the largest possible 
value of m, the largest possible value of 6 and the smallest possible value of n. Consequently, 
the sensitivity is given by: 





( 


i 


N/2 


x 2 




N/2 -2 







K 


1 






- X 





N/2-1 
1 



N/2 





which we can easily see to be 



4JV 
N+2- 



□ 



Note that the sensitivity of the x 2 -statistic grows as a function of N, but is asymptotically 
constant. This is interesting since the x 2_ statistic for a table with fixed frequencies grows 
proportional to N. In order to achieve e-differential privacy for releasing the x 2 -statistic for 
a single SNP, we need to add Laplace noise with scale \ to the true x 2 -statistic. Thus 
for increasing N, the perturbed (private) x 2 -statistics get more accurate. 

Before we consider the sensitivity of the p-values, we derive the asymptotic distribution 
of the perturbed x 2 -statistic which is a convolution of its (asymptotic) sampling distribution 
and perturbation. 

Theorem 3.3. Let a x 2 t^st statistic T have the x 2 sampling distribution with 2 degrees of 
freedom and let the perturbation Y ~ Laplace(0, 4/e). Then, the distribution of the perturbed 
X 2 test statistic, X = T + Y , has the following probability density function 



f(x) 



e 

4 e+2 



bexp(f) 



ifx<0 



[(^ + ^)expR)-^exp(-f)] z/x>0 
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and the following cumulative distribution function 

^exp(f) tfx<0 

!-f (^2 + A) ex P(-f)+i^ ex P(-f) 



Proof. Since T and K are independent random variables, the distribution of X is the con- 
volution of the given x 2 an d Laplace distributions. □ 

We show through simulations in Section [4] that the finite sample distribution is well- 
approximated by this asymptotic distribution even for tables with low total count, marginal 
counts or individual counts. This is in contrast to the poor finite sample behavior of the 
X 2 test statistics arising when the noise is added directly to the underlying cell counts (see 
Section [4]); the latter mechanism has been considered by many (e.g., [HI [in])- For related 
simulations that demonstrate the interactive effect of sample size and privacy level e and 
compare asymptotic efficiency of private and non-private estimators for 2x2 tables and the 
corresponding x 2 -statistics, see [23] . 

We now prove that the asymptotic distribution of the perturbed x 2 -statistic arising from 
perturbing the cell counts is the same as for the unperturbed x 2 -statistic, namely a x 2 ~ 
distribution with two degrees of freedom. 

Theorem 3.4. Let denote a 6- dimensional random variable corresponding to the entries 
of a 3 x 2 contingency table based on n individuals. Let Y denote a 6-dimensional random 
variable drawn from Laplace(0, 2 ) . Then the perturbed x 2 -statistic arising from perturbed cell 
counts (X^ + Y) asymptotically has a x' 2 '-distribution with two degrees of freedom. 

Proof. Let Po,Pi,P2,Qo,Qi £ [0,1] such that Po + Pi + P2 = 1 and go + Qi = 1- Under the 
null hypothesis of independence on a 3 x 2 contingency table the data is sampled from a 
multinomial distribution with probability vector p = (poQo, PoQi, PiQo, PiQi, P2Q0, P2Qi) T '■ The 
central limit theorem implies that 



^(—-P) ^(0,£), 
where £ is the covariance matrix of the product multinomial, i.e. 

e = r — pp T 

and T = diag(p). Note that E has rank 2 and therefore also T~^'ET~2. Let Y ~ Laplace(0, 
Slutsky's theorem implies that 

/*(»)+ y \ d 
y/nl Pj ^Af (0,£), 

and therefore that 
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Finally, by invoking the continuous mapping theorem, we prove the claim, namely 



Given the above derived distributions, the researcher can now compute the p-values for 
the test of independence using the perturbed x 2 -statistics (when perturbing the test statistic 
itself or when adding noise at the level of the cell counts). 

We also consider releasing differentially private p-values (without perturbing the counts 
or the related statistic first). We perform a similar sensitivity analysis on the p- values 
corresponding to the x 2 -statistics when assuming a ^-distribution with 2 degrees of freedom 
as null distribution, cf. [2]. 

Theorem 3.5. The sensitivity of the p-values of the ^-statistic for a 3 x 2 contingency 
table with positive margins and N/2 cases and N/2 controls is exp(— 2/3), when the null 
distribution is a \ 2 - distribution with 2 degrees of freedom. 

Proof. Under the null ^-distribution with 2 degrees of freedom, the p-value corresponding 
to a value x of the x 2 -statistic is 



The first derivative in absolute value is maximized by x = 0. Therefore, the sensitivity of 
the p- value is given by a change of 1 unit in a contingency table with x 2 = 0; i- e -; i n a 
contingency table of the form 




□ 



exp(--) 



x > 0. 



a a 



b b 
N/2-a-b N/2-a-b 



where a,b > 0, and a + b < N/2. We therefore need to find a, b which maximize 




b 

N/2 -a 



a 



b 

b N/2-a-b 



a 





a — 1 a 
b + 1 b 
N/2-a-b N/2-a-b 




where a, b > 0, and a + b < N/2. Equivalently, we need to maximize 




a — 1 a 
b + 1 b 
N/2-a-b N/2-a-b 
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over a, b > 0, and a + b < N/2. The corresponding x 2 -statistic is given by 

1 1 

+ 



2a 



2b + 1 



which is maximized by a = b = 1 and results in a x 2_ statistic of 4/3. Consequently, the 
sensitivity of the p- value is exp(— 2/3). □ 

The e-differentially private mechanism for a single SNP would then release a private 
p- value equal to the original value plus Laplace noise with mean zero and scale - exp(— 2/3). 

The sensitivity of the x 2 -statistic corresponds to the most 'dependent' contingency table 
while the sensitivity of the p- value is determined by an 'independent' contingency table. By 
the most 'dependent' (resp. 'independent') contingency table we mean a table which achieves 
the maximal (resp. minimal) x 2 -statistic over all contingency tables with N individuals. The 
maximal x 2 -statistic is N, while the minimal x 2 -statistic is 0. 

Since in practice we are not interested in contingency tables with very large p-values, we 
in effect have overestimated the sensitivity of the p- value, and wish instead to determine the 
sensitivity of the p-value within the range of "interesting" contingency tables. We therefore 
analyze what happens if we project all p- values, which are larger than a given value p*, onto 
p*. Since the x 2 -statistic for a table with fixed marginal frequencies grows in proportion to 
N, we analyze the situation where p* decreases with increasing N, i.e., p* = exp(— iV/c), 
where c is some constant to be specified by the user. Such a p-value corresponds to a table 
with x 2 -statistic 2N / c and can be viewed as a contingency table which is at least N/ c steps 
of Hamming distance 1 away from independence. 

Corollary 3.6. Projecting all p-values which are larger thanp* = exp(— iV/c) onto p* results 
in a sensitivity of 



exp 



N 
c 



exp 



N{2Nc -AN -Ac + c 2 



2c(Nc - 2N - c) 
for any fixed constant c > 3, which is a factor of N/2. 



Proof. The proof is similar to the proofs of Theorem 3^ and Theorem 3J3 We here give an 
overview. The contingency table 



n 5 

jVYc-2) 



N 



N(c-2) 



2c 



2c 



has a x 2 -statistic — and hence a p- value of exp(— N/c). This table has the maximal x 2 ~ 
statistic over all tables which are N/c steps of Hamming distance 1 away from independence, 
i.e., this table is N/c steps away from the following table 



r n 

2q 
JV 

2c 
N(c-2) 

2c 



N_ 

2c 
N{c-2) 

2c J 
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The largest change in % 2 -statistic is achieved by moving one individual from cell (3, 2) to cell 
(1, 2) resulting in the table 





N+c -] 



N(c-2) N(c-2)-2c 



L 2c 



2c 



This new contingency table has x 2 -statistic 

N(2Nc - AN - Ac + c 2 ) 



c(Nc - 2N - c) 



□ 



For large N, 



N(2Nc - AN - Ac + c 2 ) 



2N 



c(Nc-2N-c) c ' 

and the corresponding p- value is of the order of p*. 

In GWAS settings, however, researchers typically provide only the x 2_ statistics or the 
corresponding p- values of the M most significant SNPs. Since the ranking reveals additional 
information, it is not sufficient to add the above computed noise to these statistics in order 
to achieve differential privacy. Bhaskar et al. [1] show in the context of frequent pattern 
recognition how to release the most significant patterns together with their frequencies while 
satisfying differential privacy. We adapt their method by incorporating our results from 



Theorem [3^ and Theorem |3.5| to GWAS, and state the main result of this section: Algorithm 
1 for releasing the private x 2 -statistics (p-values) of the M most relevant SNPs. 

Let M' denote the total number of SNPs in a GWAS and M the number of statistics 
one would like to release. Naively, one might expect that it is necessary to add Laplace 
noise with scale ^-^/r^ for the x 2 -statistics and ^exp(— 2/3) for the p-values. As we see 
in Algorithm 1, however, the Laplace noise only scales with the number of actually released 
statistics M. 



Theorem 3.7. Algorithm^ is e- differentially private. 



Proof. Using the sensitivities computed in Theorem 3^ and Theorem 3J3, the proof follows 
immediately from Theorem 5 in [TJ. □ 



4 Evaluation of Methodology and Results 

We now evaluate the performance of the proposed methods based on data from a simulation 
study and using a GWAS data set consisting of 685 dogs and their hair length. The GWAS 
data for the hair length of dogs has first been presented and studied in j3j and further been 
analyzed in [16]. It consists of 685 dogs, 319 dogs with long hair cLS CclSCS and 364 with 
short hair as controls, and contains 40, 842 SNPs. Cadieu et al. [1] have shown that the long 
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Algorithm 1 e-Differentially Private Algorithm for Releasing the M Most Relevant SNPs 
Input: The x 2_s tatistics (resp. p-values) for all M' SNPs and the number of statistics, 
M, we want to release. 

Output: The M noisy x 2 -statistics (resp. p- values). 

1. Add Laplace noise with mean zero and scale ]f^2 to the x 2 -statistics (resp. Laplace 
noise with mean zero and scale — exp(— 2/3) to the p- values). 

2. Pick the top M SNPs with respect to the perturbed x 2 -statistics (resp. p- values). 
We denote the corresponding set of SNPs by S. 

3. Add new Laplace noise with mean zero and scale ^rj^ to the true x 2 -statistics of 
the SNPs in S (resp. Laplace noise with mean zero and scale — exp(— 2/3) to the 
true p-values) and release these perturbed statistics. 



versus short hair phenotype is associated with a mutation in the fibroblast growth factor-5 
(FGF5 gene) and the largest x 2 -statistic is achieved by a SNP located on chromosome 32 at 
position 7, 100,913, i.e., about 300Kb apart from FGF5. 

We also use the simulations from [16] performed using HAP-SAMPLE [24]. HAP- 
SAMPLE generates the cases and controls by resampling from HapMap. The simulated 
data show linkage disequilibrium and allele frequencies similar to real data. The simulated 
association studies consist of 400 cases and 400 controls with about 10,000 SNPs per individ- 
ual (SNPs typed with the Affy CHIP on chromosome 9 and chromosome 13 of the Phase I /II 
HapMap data). Two SNPs were chosen to be causative and the simulations were performed 
for three different MAFs (0.1, 0.25 and 0.4) and two different models of interaction (additive 
effect and multiplicative effect of the two SNPs). See [IB] for more details. 

For this paper, we omit the simulation results on the statistical utility of e- differentially 
private release of aggregate MAFs. Our results are similar to those reported in the current 
literature on Laplace mechanism for noise addition to histograms or smaller contingency 
tables with proportions (e.g., [9], [23J). Instead, we focus on the release of differentially- 
private x 2 -statistics, p-values and the most relevant SNPs. 



4.1 Asymptotic distribution of the perturbed x 2 -statistic 

We first present results on the asymptotic distribution of the perturbed x 2 -statistic arising 



from adding noise directly to the statistic, as derived in Theorem |3.3[ and evaluate the 
accuracy of the asymptotic approximation. The distribution for e = 0.2 is described in 
Figure 13 and a comparison of three distributions, namely the asymptotic ^-distribution, 
the asymptotic Laplace distribution and their convolution for different values of the privacy 
parameter e are shown in Figure [2j we can observe that the asymptotic distribution of the 
perturbed x 2 -statistic is very similar to the underlying Laplace distribution as expected 
based on the convolution derived in Theorem 13.31 
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Figure 1: Asymptotic distribution of the perturbed \ 2 test statistic for e = 0.2: density 
function (left), cumulative distribution function (middle), and quantile function (right). 




Figure 2: Comparison of the asymptotic sampling distribution (black line), perturbation 
(black dotted line) and its convolution (red line) for e = 0.1 (left), e = 0.2 (middle left), 
e = 0.3 (middle right), and e = 0.4 (right). 
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Through simulations we analyzed at which point the asymptotic approximation seems to 
be accurate for finite samples. It turns out that even for tables with very small cell counts or 
marginal counts, the finite sample distribution of the private x 2 -statistic is well-approximated 
by its asymptotic distribution, although it is well known that the exact distribution of the 
original x 2-s tatistic is very poorly approximated by the ^-distribution for small samples. 
As an example we discuss the following table: 



1 3 
8 12 
41 35 



We ran a Markov chain on the set of contingency tables which have the same margins as 
the above table using tools from Algebraic Statistics, namely elements of a Markov basis as 
moves (e.g., see [S]). At each step (table), we computed the corresponding x 2 -statistic and 
added Laplace noise with scale -. The resulting posterior distribution is an approximation 
to the true distribution of the perturbed x 2 -statistic and corresponds to the black dotted line 
in Figure [3j The asymptotic distribution of the perturbed x 2 -statistic derived in Theorem 



3.3 is shown in red. These plots and additional simulations show that the asymptotic ap- 
proximation is accurate even for tables with a low total count, marginal counts or individual 
cell counts. 

Similarly, we now analyze under which conditions the asymptotic distribution of the 



perturbed x 2 -statistic arising from perturbing the cell counts, as shown in Theorem 3.4 
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Figure 3: Asymptotic distribution of the perturbed x 2 -statistic (red line) and its true distri- 
bution (black dotted line). 
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appears to be accurate for finite samples. As we will see, when adding noise to the cell 
counts instead of the x 2_s tatistic, the asymptotic distribution of the computed statistic is 
only accurate for a very large total cell count. We analyze the following tables, one with a 
total cell count of 10,000 and two with a total cell count of 100,000: 





1400 


1600 




14000 


16000 




1 


3 


(1) 


1900 


1300 


, (2) 


19000 


13000 


, (3) 


26000 


21000 




1700 


2100 




17000 


21000 




23999 


28997 



We again ran a Markov chain on the set of contingency tables which have the same 
margins as the above tables using a Markov basis to move between tables. At each step we 
perturbed the counts by adding Laplace noise with scale - and computed the corresponding 
perturbed x 2 -statistic. The resulting posterior distribution is an approximation to the true 
distribution of the perturbed x 2 -statistic and is shown in Figure [i] for four values of the 
privacy parameter e. Also the true distribution of the unperturbed x 2 -statistic and the \ 2 ~ 
distribution are shown for comparison. Note that a total cell count of 10, 000 is not sufficient 
for a good approximation of the finite sample distribution by the asymptotic distribution. 
For a total cell count of 100, 000 the approximation appears to be accurate as long as the 
individual cell counts and margins are not too small, as in the case of the third table. 



4.2 Differentially-Private x 2 -Statistics 

Based on the results from the previous section, releasing differentially private x 2 -statistics 
versus perturbing cell counts and then computing the perturbed statistics, seems to work 
better on finite (and smaller) samples. Next, we focus on evaluating the statistical utility 
of the proposed release mechanism following Theorem 3.2 We compare the e- differentially 
private x 2 -statistic to the original statistic via KL divergence. We generated 3x2 contingency 
tables with positive margins and N/2 cases and N/2 controls assuming a product-multinomial 
distribution with the following frequencies: 



true 

chi-squared 
epsilon = 0.1 
epsilon = 0.2 
epsilon = 0.3 
epsilon = 0.4 




10 20 30 

Chi-square statistic 

(a) Table (1) 



true 

chi-squared 
epsilon = 0.1 
epsilon = 0.2 
epsilon = 0.3 
epsilon = 0.4 



10 20 30 

Chi-square statistic 
(b) Table (2) 



true 

chi-squared 
epsilon = 0.1 
epsilon = 0.2 
epsilon = 0.3 
epsilon = 0.4 




10 20 30 
Chi-square statistic 

(c) Table (3) 



Figure 4: Exact and asymptotic distribution of the perturbed and unperturbed x 2 -statistic. 
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0.52 
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"0.47 
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0.45 


0.51 


, (d) 


0.29 


0.43 




0.08 


0.24 




0.06 


0.11 



(2) 



For the ^-distribution with 2 degrees of freedom, an observed value of 6 corresponds to a 
Rvalue of exp(— 3) ~ 0.05. The preceding frequency tables correspond to contingency tables 
for which we expect a p- value of 0.05 for 



For example, for iV 
of the form 



a) N = 20, (6) N = 40, (c) N = 80, (d) N = 160. 

200 individuals and underlying frequency table (a) we expect a table 



72 
18 
10 



20 

28 
52 



which has a x 2 -statistic of 60. Therefore, for N = 20 we expect a x 2 -statistic of 6. If we fix 
the number of individuals N, then the x 2 -statistic corresponding to frequency table (a) is 
the largest, namely 8 times the x 2 -statistic corresponding to frequency table (d). 

The choice of the frequency tables in ^ is motivated by the GWAS on the hair length 
of dogs in [1] and our simulations using HAP-SAMPLE. The x 2 -statistic resulting from the 
frequency table (a) is comparable to the x 2_ statistic of the SNP most associated to the 
hair length in dogs (on chromosome 32 at position 7,100,913 in the CanMap data set). 
The x 2 -statistic resulting from the frequency table (c) is comparable to the x 2 -statistic of a 
causative SNP in a simulated association study under the additive model (i.e., main effects 
only model) for MAF = 0.4, and (d) is comparable to a causative SNP under the additive 
model for MAF = 0.25. The frequency table (b) corresponds to an intermediate model for 
a causative SNP with high MAF and was added for consistency. 

For a fixed total number of individuals N, we generated 10,000 tables from the frequency 
tables in ^ and computed the corresponding x 2 -statistics. We also generated 10,000 pri- 
vate ^-statistics according to the Laplace mechanism described following Theorem 3J2 In 
Figure [5] we plotted the KL divergence between the original and the private x 2_ statistics 
for increasing N and for four different levels of privacy. The four plots correspond to the 
four frequency tables in Q. We see that the KL divergence depends on the x 2_ statistic of 
the underlying frequency table, the total number of individuals N, and the privacy level e. 
Since the added noise is asymptotically Laplace(0,4) distributed, the larger the original x 2 - 
statistic, the smaller the KL divergence is. Similarly, a larger number of individuals N leads 
to a larger x 2 -statistic and hence to a smaller KL divergence. The scale of the Laplace noise 
is inverse proportional to the privacy parameter e. Therefore, the smaller e, the larger the 
KL divergence is. These simulations demonstrate that it is possible to release e- differentially 
private x 2 -statistics and maintain good statistical utility in a realistic GWAS setting. 
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Figure 5: KL divergence between the original x 2 -statistic and the private x 2 -statistic based 
on the frequency table (a) left, (b) middle left, (c) middle right, and (d) right. 



4.3 Differentially-Private p- Values 

We did a similar analysis on the p- values following the proposed release mechanism of adding 
Laplace noise according to Theorem 3.5 Based on the frequency tables in we computed 



the KL divergence between the original and private p-values for increasing iV and for four 
different privacy levels. The resulting plots are shown in Figure |6j Similarly to the x 2 ~ 
statistics, the smaller e, the larger the KL divergence is. However, the relation between 
the KL divergence and the number of individuals, resp. the original x 2 -statistic, is reversed 
since, for the ^-distribution with 2 degrees of freedom, the x 2_ st a tistic is proportional to 
the logarithm of the p-value. The larger the x 2 -statistic, the smaller the p-value and hence 
the smaller the signal to noise ratio. The jumps in the figures arise because we project the 
perturbed p- values which fall outside the interval [0,1] to or 1, respectively. Although there 
is a one-to-one correspondence between the x 2_s tatistics and the p-values, the x 2 -statistics 
have a much smaller KL divergence and are therefore better suited for privacy purposes. 

Projecting the p- values onto a region of interest as described in Corollary 3.6| results in 



plots similar to those in Figure pi the plots depend on how much smaller the p- value under 



consideration is compared to 1 in the case of Theorem 3J3 and p* in the case of Corollary 




500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 500 1000 1500 2000 

Sample size Sample size Sample size Sample size 



Figure 6: KL divergence between the original p- values and the private p- values based on the 
frequency table (a) left, (b) middle left, (c) middle right, and (d) right. 
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Type I error Type I error Type I error Type I error 



(a) e = 0.1 (b) e = 0.2 (c) e = 0.3 (d) e = 0.4 

Figure 7: ROC curves for the perturbed p-values. 

Our analysis and the plots in Figure [6] strongly suggest that perturbing the p-values 
to achieve e-differential privacy leads to too much noise. Making inference based on such 
perturbed p-values seems impossible. However, it is a valid question to ask whether there 
might exist a cut-off which could control the Type I & Type II errors. 

We analyze this question by sampling 500 true positives (p-values in [0, 0.05]) and 500 true 
negatives (p-values in [0.05,1]) uniformly and adding Laplace noise with scale exp(— |)/e. 
We represent the simulated data in an ROC plot, where we report for all possible cut-off 
values the resulting Type I and Type II errors. These plots for four levels of privacy, namely 
e = 0.1,0.2,0.3,0.4 are shown in Figure [7} We especially indicate the point corresponding 
to the usual cut-off of 0.05. 

Figure [7] confirms that using the perturbed p-values as a test for independence is not 
much better than a random test, independent of the chosen cut-off. Choosing a cut-off of 
0.05 seems reasonable, but it is anyways impossible to control the Type I & Type II errors. 
An interesting feature in the plots are the long straight lines going from both corners along 
the diagonal. These lines arise since we project the perturbed p-values which fall outside 
the interval [0, 1] to either or 1. These plots show again that the perturbed p-values are 
dominated by these projected 0's and l's rendering a test based on the perturbed p-values 
uninformative. 

4.4 Releasing the M Most Relevant SNPs with Respect to a Spe- 
cific Phenotype 

Practitioners are often interested in finding and releasing the most relevant (i.e., most sta- 
tistically and practically significant) SNPs. Here we analyze what sample size N is needed 
in order to recover the two causative SNPs in the HAP-SAMPLE simulations from the pri- 
vate x 2 -statistics. We chose M = 3 and plotted the frequencies (based on 1,000 private 
X 2 -statistics) for which one or both of the two causative SNPs were among the three highest 
ranked private x 2 -statistics computed according to Algorithm [T] We performed this analysis 
for increasing sample size N and for four different privacy levels. We used the simulated 
HAP-SAMPLE data consisting of around 10,000 SNPs total with two causative SNPs under 
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the additive model with MAF=0.25 and MAF=0.4. The resulting bar charts are shown in 
Figure [§} 

As we expect, a larger value of e (i.e., less noise/less privacy) results in a higher chance 
of releasing the truly causative SNPs. We also observe that the smaller the MAF, the more 
data we need to detect the causative SNPs at a fixed level of e. For example, for e = 0.4, 
Figure [8] shows that for MAF=0.4 we need about 7,500 individuals to detect the causative 
SNPs whereas for MAF=0.25 we need about 10,000 individuals. A smaller MAF corresponds 
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Figure 8: Bar charts representing the frequencies for which one or both of the two causative 
SNPs were among the three highest ranked private x 2_ statistics under the additive model 
with MAF=0.25 (top) and MAF=0.4 (bottom). 
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to a sparser table, and we are in a similar situation to that described in [10J, where it is 
shown that for sparse tables differential privacy requires adding a lot of noise, often with 
the result of impairing statistical inference. Our results support the traditional trade-off: in 
order to detect important effects, we need to either relax the privacy constraint or increase 
the total number of individuals massively. 

An alternative to adding noise to the data we want to release is to add noise to the 
analysis itself. We explain this approach for GWAS in the following section. 

5 Extended Work: Differentially- Private Algorithm for 
Detecting Epistasis 

As we just saw, the sparseness of GWAS data requires an unrealistically large number of 
individuals in each study or a relaxation of the privacy level. In order to deal with sparseness, 
methods have been proposed, where the Laplace noise is added to the analysis directly instead 
of to the output. Another advantage of such an approach is that it allows the analysis of 
models that integrate information across SNPs. Here we present an e-differentially logistic 
regression approach that is directly applicable to GWAS. 

Most methods for detecting epistasis are based on a two-stage approach. First, all SNPs 
are filtered e.g. using x 2 -statistics or p-values, to reduce the potential interacting SNPs to a 
small number. The loci achieving some threshold are then further examined for interactions. 
A widely used test for detecting gene-gene interactions on a small number of SNPs is a 
penalized logistic regression, e.g. the L 2 -regularized logistic regression proposed by Park 
and Hastie [19J. By adapting the work of Bhaskar et al. [1] and Chaudhuri et al. |5J, we 
derive a privacy-preserving method for detecting epistasis, where both stages in the two-stage 
approach satisfy differential privacy. 

We use the first two steps in Algorithm[T]to chose a subset of interesting SNPs of size M in 
a differentially private way. Park and Hastie [19] suggest an Z/2-regularized logistic regression 
in order to detect epistasis within a small subset of SNPs. Chaudhuri et al. [5] demonstrated 
how to perturb the objective function for privacy-preserving machine-learning algorithm 
designs if the loss function and the regularizer satisfy certain convexity and differentiability 
criteria. In the following, we outline how to apply their objective perturbation in order to 
find a differentially private algorithm for detecting epistasis. 

Let y = (yi, ■ ■ ■ ,Un) denote the disease status of the N individuals. Note that in this 
section we encode the diseased status by 1 and the non-diseased status by -1. Let Xi G W +1 
denote the feature vector for the i th individual. The first entry corresponds to the intercept. 
The encoding of the features is explained via an example. We will look at a model with two 
SNPs including their interaction. SNP1 takes the three states 0, 1, and 2, which are encoded 
by 100, 010, and 001. Similarly for SNP2. The interaction term SNPlxSNP2 takes the states 
00, 01, 02, 10, 11, 12, 20, 21, 22 and is encoded by 100000000, 010000000, . . . , 000000001. So 
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an individual with genotype 12, who is not diseased would have 

x = (1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0), y = -1. 

Let K — 1 be the total number of effects in the model (including main and higher-order 
effects). It is important to note that ||^i||2 < K. 

The objective function described in Park and Hastie jl9] is 

N 1 
L(P) = £>g(l + exp(-y t p T x t )) + -/3 T A/3, 

i=i 

where A is of the form (0, A, . . . , A), i.e. 0o is not penalized. They use the Newton- Raphson 
method for the optimization and forward selection and backward deletion steps based on an 
Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) score to select 
model size and important factors. 

We can apply the approach of Chaudhuri et al. [5j to perturb the objective function 
such that the algorithm satisfies e-differential privacy. We are interested in the following 
perturbed objective function: 

- 1 1 

L piiY (f3) = Ml + exp(- 2/i /3 T x i )) + -(3 T K(3 + -b T /3 } 

i=i 

where b is noise drawn from a distribution with density 

/(6) = -«p(-A;|6| 2 ) 
a 

and k is a constant and a the normalizing constant. 

Following the proposal by Park and Hastie [19] we make use of forward selection and 
backward deletion steps based on an AIC or BIC score to select model size; however, we 
replace the optimization step in their method by Algorithm [2] 

Theorem 5.1. Algorithm^ is e- differentially private. 



Algorithm 2 e-Differentially Private Algorithm for Detecting Epistasis 

Input: The data vectors Xi, y^ where i — 1, . . . , N and parameters e, A, and c. 
Output: The output consists of the noisy effects. 

1. Let e' = e - log(l + + f^J). If e' > 0, then 5 = 0, else 5 = N{e c c %_ 1) - A and 
e' = e/2K. 

2. Draw b from a distribution with density f(b) = ^exp(— ^^). 

3. Compute /3 priv = argmin(L priv (/3) + §6||/3| 2 ). 
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Proof. The proof follows from Theorem 9 in [5], and by taking into account the fact that 
! 2^1 2 — K for our application. □ 

This result allows us to move away from a SNP-by-SNP analysis to an integrated approach 
without relaxing privacy. Applying this method to actual GWAS data is part of ongoing 
work. 

6 Conclusion 

In this paper, we have demonstrated that it is possible, using the formal privacy guarantees 
of differential privacy, for NIH and other GWAS data repositories as well as "GWAS data 
owners" to release at least some genetic data required by practitioners. More specifically, we 
described a privacy-preserving release of aggregate minor allele frequencies and the release 
of differentially-private x 2- statistics and p-values. We also provided a differentially private 
algorithm for releasing these statistics for the most relevant SNPs. 

Our simulations, however, indicate that for bigger and sparse data the release of simple 
summary statistics is problematic and not sufficient from both privacy and utility perspec- 
tives. The release of summary statistics may be at least in part sufficient for traditional 
piecewise SNP-by-SNP analysis. More specifically, our results on finite sample properties of 
differentially-private x 2 -statistics show that adding noise directly to the x 2 -statistic achieves 
the best trade-off between privacy and utility in comparison to adding noise to the p- values 
or cell entries themselves, in particular for tables with small to moderate counts and overall 
samples size. However, we require more complex methodology to deal with more sparse 
data and models that integrate across SNPs to detect epistasis. To address this problem, we 
outlined an e-differentially private algorithm for a specific form of penalized logistic regres- 
sion. This is but one of the newer methods being introduced into the statistical literature 
for GWAS, but we expect that the general strategy suggested here might be adaptable for 
other statistical methods, e.g., for sparse partitioning [21] . 

Since the introduction of differential privacy by [U] , and in particular e-differential privacy, 
many additional variations along with their considerations with respect to statistical analysis 
have been proposed (e.g., more recently [H]). To further improve the privacy-utility tradeoffs 
for GWAS, the future research would consider such alternate mechanisms. 
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